Ch. 20 - Simple Seq2Seq

Welcome to week 5. You now already have a solid foundation and know the most important pieces of modern deep neural networks. In the next few weeks, you will gain practical knowledge of advanced models. You will find them much more compute intensive and more difficult to work with than the classifiers we looked at so far. But you will also recognize their ability to perform tasks that until very recently were science fiction. We will dive in head first with a sequence to sequence model that can translate English to French.

Translation

In 2016, Google announced that it had replaced the entire google translate algorithm with a single neural network. The special thing about the Google Neural Machine Translation system is that it translates many languages 'end to end' using only a single model. It works by encoding the semantics of a sentence and then decoding the semantics into the desired output language. The fact that such a system is possible at all baffled many linguists and other researchers as it shows that machine learning can create systems that accurately capture high level meanings and semantics without being given any explicit rules. These semantic meanings are represented as an encoding vector and while we don't quite yet know how to interpret these vectors there are sure a lot of useful applications for them. Translating from one language to another is popular, but we could use a similar approach to 'translate' a report into a summary. In this chapter we will follow a similar approach but train a much simpler model that can only translate from English to French. We will use this task to demo a simple sequence to sequence (Seq2Seq) model.

Overview

If all phrases had the exact same length, we could simply use an LSTM (or multiple). Remember that an LSTM can also return a full sequence of the same length as the input sequence. However, in many cases sequences will not have the same length. To deal with different lengths of phrases, we first create an encoder which aims to capture the sentences semantic meaning. We then create a decoder, which has two inputs: The encoded semantics and the sequence that was already produced. The decoder then predicts the next item in the sequence. For our character level translator this looks like this:

Note how the output of the decoder is used as the input of the decoder again. This process is only stopped once the decoder produces a < STOP > tag that indicates that the sequence is over.

The data

We use a dataset of English phrases and their Translation. We implement this model on a character level, which means we won't tokenize words as in previous models but characters. This makes the task harder for our network because it now also has to learn how to spell words! But on the other hand there are much less characters than words so we can just one hot encode characters and don't have to work with embeddings. This makes our model a bit simpler. Without much further ado, let's load the data.


In [4]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

In [5]:
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra-eng/fra.txt'

Input (English) and target (French) is tab delimited in the data file. Each row represents a new phrase. The translations are separated by a tab (escaped character: \t). So we loop over the lines and read out inputs and targets by splitting the lines at the tab symbol.

To build up our tokenizer, we also need to know which characters are present in our dataset. So for all characters we check whether they are already in our set of seen characters and if not add them to it.


In [6]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

# Loop over lines
lines = open(data_path).read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
    # Input and target are split by tabs
    # English TAB French
    input_text, target_text = line.split('\t')
    
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    
    # Create a set of all unique characters in the input
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
            
    # Create a set of all unique output characters
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

print('Number of samples:', len(input_texts))


Number of samples: 10000

Now we build up our tokenizer. While this could be done with the Keras tokenizer, we will just do it manually here. Note that we build a different tokenizer for input and output, as some characters might appear in French but not in English and the other way around.


In [7]:
input_characters = sorted(list(input_characters)) # Make sure we achieve the same order in our input chars
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters) # aka size of the english alphabet + numbers, signs, etc.
num_decoder_tokens = len(target_characters) # aka size of the french alphabet + numbers, signs, etc.


print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)


Number of unique input tokens: 71
Number of unique output tokens: 93

In [10]:
# This works very similar to a tokenizer
# The index maps a character to a number
input_token_index = {char: i for i, char in enumerate(input_characters)}
target_token_index = {char: i for i, char in enumerate(target_characters)}

In [11]:
# Demo character tokenization
for c in 'the cat sits on the mat':
    print(input_token_index[c], end = ' ')


63 51 48 0 46 44 63 0 62 52 63 62 0 58 57 0 63 51 48 0 56 44 63 

Next, we build up our model training data. Remember that our model has two inputs but one output. While our model can handle sequences of any length, it is handy to prepare the data in Numpy and thus to know how long our longest sequence is:


In [8]:
max_encoder_seq_length = max([len(txt) for txt in input_texts]) # Get longest sequences length
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)


Max sequence length for inputs: 16
Max sequence length for outputs: 59

Now we prepare input and output data for our model.


In [7]:
# encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) 
# containing a one-hot vectorization of the English sentences.

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')

# decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) 
# containg a one-hot vectorization of the French sentences.

decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# decoder_target_data is the same as decoder_input_data but offset by one timestep. 
# decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :]

decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

You can see that the input and output of the decoder is the same except that the output is one timestep ahead. This makes sense when you consider that we will feed an unfinished sequence into the decoder and want it to predict the next character.


In [8]:
# Loop over input texts
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    # Loop over each char in an input text
    for t, char in enumerate(input_text):
        # Create one hot encoding by setting the index to 1
        encoder_input_data[i, t, input_token_index[char]] = 1.
    # Loop over each char in the output text
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

The model - Keras functional API

The avid reader will wonder how we create a model with two inputs. After all, all our models so far had one input, one output and nothing more. Enter the Keras functional API. So far, we used Sequential models. In the Sequential model, layers get stacked on top of each other when we call model.add(). In the functional API, we have a bit more control and can specify how layers should be connected. Let's look at a simply two layer network in both the Sequential and functional way:


In [12]:
# Sequential model:
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(64, input_dim=64))
model.add(Activation('relu'))
model.add(Dense(4))
model.add(Activation('softmax'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 260       
_________________________________________________________________
activation_2 (Activation)    (None, 4)                 0         
=================================================================
Total params: 4,420
Trainable params: 4,420
Non-trainable params: 0
_________________________________________________________________

In [16]:
# Functional API
from keras.models import Model
from keras.layers import Dense, Activation, Input # Note that input is a layer now too

model_input = Input(shape=(64,))
x = Dense(64)(model_input)
x = Activation('relu')(x)
x = Dense(4)(x)
model_output = Activation('softmax')(x)

model = Model(model_input, model_output)
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_3 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 260       
_________________________________________________________________
activation_4 (Activation)    (None, 4)                 0         
=================================================================
Total params: 4,420
Trainable params: 4,420
Non-trainable params: 0
_________________________________________________________________

You can see that the functional API can connect layers in more advanced ways than the Sequential model. We can also separate the layer creation and connection step. Note however, that once a layer is connected, we can not reuse it.


In [21]:
# Functional API
from keras.models import Model
from keras.layers import Dense, Activation, Input # Note that input is a layer now too

model_input = Input(shape=(64,))
dense = Dense(64)
x = dense(model_input)
activation = Activation('relu')
x = activation(x)
dense_2 = Dense(4)
x = dense_2(x)
model_output = Activation('softmax')(x)

model = Model(model_input, model_output)
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_6 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 4)                 260       
_________________________________________________________________
activation_7 (Activation)    (None, 4)                 0         
=================================================================
Total params: 4,420
Trainable params: 4,420
Non-trainable params: 0
_________________________________________________________________

We will use the functional API to create a model with two inputs.

Recall the info graphic from earlier:

You see that the decoder also has two inputs: the decoder inputs and the encoded semantics. The encoded semantics however are not directly the outputs of the encoder LSTM but its states. In an LSTM, states are the hidden memory of the cells. What happens is that the first 'memory' of our decoder is the encoded semantics. To give the decoder this first memory, we can initialize its states with the states of the decoder LSTM.


In [17]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens), name = 'encoder_inputs')

# The return_state contructor argument, configuring a RNN layer to return a list 
# where the first entry is the outputs and the next entries are the internal RNN states. 
# This is used to recover the states of the encoder.
encoder = LSTM(latent_dim, return_state=True, name = 'encoder')

encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens), name = 'decoder_inputs')

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, name = 'decoder_lstm')

# The inital_state call argument, specifying the initial state(s) of a RNN. 
# This is used to pass the encoder states to the decoder as initial states.
# Basically making the first memory of the decoder the encoded semantics
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

decoder_dense = Dense(num_decoder_tokens, activation='softmax', name = 'decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [18]:
model.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
encoder_inputs (InputLayer)     (None, None, 71)     0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     (None, None, 93)     0                                            
__________________________________________________________________________________________________
encoder (LSTM)                  [(None, 256), (None, 335872      encoder_inputs[0][0]             
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 256),  358400      decoder_inputs[0][0]             
                                                                 encoder[0][1]                    
                                                                 encoder[0][2]                    
__________________________________________________________________________________________________
decoder_dense (Dense)           (None, None, 93)     23901       decoder_lstm[0][0]               
==================================================================================================
Total params: 718,173
Trainable params: 718,173
Non-trainable params: 0
__________________________________________________________________________________________________

In [ ]:
# Visualize model 
# NOTE: This code requires Graphviz and pydot to run
# The output is also attached in markdown below
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model).create(prog='dot', format='svg'))

Now it is time to train the model. Use a GPU since this might take a while otherwise. Alternative you can also load the weights provided.


In [ ]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
# Save model
model.save('s2s.h5')

In [12]:
model.load_weights('s2s.h5')

Creating inference models

For inference, we want a different model than for training. Encoding and decoding should now be separated into two different models. Luckily the functional API allows us to reuse the layers defined for another model and retain their trained weights:


In [13]:
# Define encoder model
encoder_model = Model(encoder_inputs, encoder_states)

In [14]:
# Define decoder model

# Inputs from the encoder
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))

# Create a combined memory to input into the decoder
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Decoder
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

# Predict next char
decoder_outputs = decoder_dense(decoder_outputs)

# The model takes in the encoder memory plus it's own memory as an input and spits out 
# a prediction plus its own memory to be used for the next char
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

Translating

We can now start to use our model. First we create an index which maps tokens to characters again.


In [22]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = {i: char for char, i in input_token_index.items()}
reverse_target_char_index = {i: char for char, i in target_token_index.items()}

When we translate a phrase, we now first encode the input. We then loop, feeding the decoder states back into the decoder until we receive a STOP (in our case we use the tab character to signal STOP).


In [16]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    
    # Loop untill we recieve a stop sign
    while not stop_condition:
        # Get output and internal states of the decoder 
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Get the predicted token (the token with the highest score)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        # Get the character belonging to the token
        sampled_char = reverse_target_char_index[sampled_token_index]
        # Append char to output
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

Now we can translate English to French! At least for some phrases it works quite well. Given that we did not supply our model with any rules about French words or grammar this is quite impressive. Translation systems like Googles of course use much bigger datasets and models, but in principle it is the same.


In [91]:
my_text = 'Cheers!'
placeholder = np.zeros((1,len(my_text)+10,num_encoder_tokens))

In [92]:
for i, char in enumerate(my_text):
    print(i,char, input_token_index[char])
    placeholder[0,i,input_token_index[char]] = 1


0 C 21
1 h 51
2 e 48
3 e 48
4 r 61
5 s 62
6 ! 1

In [93]:
decode_sequence(placeholder)


Out[93]:
'Santé !\n'

Summary

In this chapter you have learned the basics of Seq2Seq models and the functional API. You have built a sophisticated model that can translate English phrases to French. The concepts you learned in this chapter can be extended to for example text summarization.